202602130920 - swebench-verified-vs-pro

Main Topic

Question: What is the difference between SWE-bench Verified and SWE-bench Pro, and how should I read model comparison charts that show both?

SWE-bench is a benchmark where an agent is given a real repository plus an issue description, and must produce a patch that makes newly added fail-to-pass tests pass while keeping existing pass-to-pass tests passing. The score is therefore a measure of end-to-end software issue resolution under a fixed evaluation harness.

SWE-bench Verified and SWE-bench Pro are two different attempts to make this measurement more reliable, but they target different failure modes.

SWE-bench Verified is a curated subset of SWE-bench designed to reduce false negatives caused by benchmark noise. In the original SWE-bench test set, some tasks are effectively underspecified, have overly specific tests, or suffer from environment brittleness, which can cause correct or reasonable fixes to be marked wrong. Verified was created via human screening by professional software developers to keep only tasks that are well-scoped and solvable, plus improvements to evaluation reliability (containerized harness). In practical terms, Verified tends to measure how well an agent can solve medium-sized, realistic Python OSS issues when the task specification and evaluation are not sabotaging the solver.

SWE-bench Pro is designed to be a harder, more contamination-resistant benchmark that better reflects long-horizon professional engineering tasks across more diverse, industrially relevant codebases. Scale’s description emphasizes (1) stronger contamination resistance by construction (e.g., copyleft-licensed OSS subsets and private codebases), (2) greater task diversity beyond a small set of common utility libraries, (3) less oversimplification (issues are augmented by humans to be solvable without being trivial), and (4) reproducible environments built and validated by professional engineers. It also tends to require larger patches across more files, which increases the need for planning, codebase navigation, and sustained tool use.

How to interpret comparison charts that show both:

  1. Treat the two bars as measuring different regimes.
  1. Expect a large performance drop from Verified to Pro.
    If a model (or agent scaffold) drops from around 70 percent on Verified to around 20 percent on Pro, that does not necessarily mean it got worse at coding; it often means Pro is testing additional capabilities that Verified only weakly stresses: multi-file reasoning, sustained execution, better context gathering, and robustness to more realistic repository structure and tooling.

  2. Use the gap as a diagnostic.

  1. Read the metric definition carefully.
    Both benchmarks typically report a resolve rate (pass at 1), where a task counts as solved only if the patch makes fail-to-pass tests pass and does not break pass-to-pass tests. Charts can still be misleading if different leaderboards use different scaffolds, tool budgets, or environment constraints. When comparing models, prefer results produced under a single, consistent scaffold.

Practical takeaway: use SWE-bench Verified as a baseline for evaluating whether a model can solve well-posed OSS bugfix tasks, and use SWE-bench Pro to evaluate whether it can operate as a robust software engineering agent in harder, less gameable conditions. A chart that shows both is most useful for understanding where the system’s bottlenecks shift from code generation to end-to-end agent competence.

🌲 Branching Questions

What exactly does “Verified” verify?

Verified is primarily verifying benchmark quality: the issue description is sufficiently specified, the tests reflect the intended fix rather than hidden requirements, and the environment/harness is reliable enough that correct solutions are not rejected for incidental reasons. It is a quality-controlled slice of SWE-bench meant to reduce systematic underestimation caused by noisy tasks.

What makes Pro harder beyond “more tasks”?

Pro is designed to introduce harder and more realistic conditions: broader codebase diversity, longer-horizon tasks, larger multi-file changes, stronger contamination resistance, and human-built reproducible environments with human checkpoints for requirements and test relevance. These factors increase the need for exploration, planning, and iterative debugging rather than single-shot patch writing.

How should I use these benchmarks when choosing a model for my team?

If your workflow resembles quick bug fixes in familiar OSS-style repos, Verified correlates more directly. If you are building an agent that must navigate unfamiliar repos, apply multi-file refactors, and iterate over failures, Pro is a better stress test. In both cases, consider running a small internal evaluation suite on your own codebase to complement public benchmarks.

How should I read a bar chart comparing models on both benchmarks?

Read it as two dimensions:

What are common pitfalls when people cite SWE-bench scores?

References